%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Basic proposal for bioinformatics 2012
%% Edits: 
%%	m.s. -- edited with crappy windoze
%%	c.w. -- used editor of the beast VI VI VI
%%   m.s. -- fixing chris' typos caused by not being able to fully harness the beast VI VI VI
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\documentclass[10pt,a4,oneside]{report}
\usepackage[margin=0.75in]{geometry}

\begin{document}
\title{CAP 6545 Bioinformatics Proposal}
\author{Michael Semeniuk, Albert Steppi, and  Christopher Wolas}
\maketitle
% Set mood for the paper, dim the lights put on Barry White...or Randy Newman to slick the groove and make it weird.
This research intends to analyze the results of gene expression profiling. Genomic sequencing reveals all possible behaviors for a cell, those expressed and not expressed. But expression profiling reveals only the observable behavior of a cell and gives the opportunity to observe upregulation and downregulation of certain cellular components.

% Enough fluffing, explain the selection of Problem
The research question that will be explored in our efforts is based on extending the paper, "Mining Gene Expression Profiles: An Integrated Implementation of Kernel Principal Analysis and Singular Value Decomposition" \cite{Reverter2010200}.  In the original work, the authors implemented a novel algorithm called KPCA-Biplot. This algorithm incorporates both Kernel Principal Component Analysis (KPCA) and Singular Value Decomposition Biplots (SVD-Biplot) in order to plot and analyze microarray samples and genes simulatenously and cluster the sample types visually. 

Principal Component Analysis allows the identification of similarities in high dimensional data. In the scope of their work, the author's used PCA to find correlations between genes and microarray samples. SVD was employed to find these similarities in the same n-dimensional space \cite{Reverter2010200}.

The algorithm can be described as \cite{Reverter2010200}:

\begin{enumerate}  \itemsep -2pt % squeeze the numbers together. Way too much spacing for my liking
\item
% Added the first step since we'll have to do it
Preprocess gene expression data from dataset using techniques such as normalization and gene centering.
\item 
Perform standard SVD of processed gene expression data of the form: $\bf{X} = \bf{G}\bf{H}^{T}$.
\item
Compute the kernel matrix, $\bf{K}$, from  the rows of H.
\item
Perform KPCA on the kernel matrix found above in order to extract the nonlinear features found in the gene expression data.
% Added this step because I think it is implicit but is worth noting
\item
Select leading eigenvectors.
\item
Project the rows of $\bf{G}$ onto the subspace of the chosen eigenvectors of $\bf{K}$.
% Added this for clarity of what we do after projection
\item
Plot genes and microarray samples on centered biplot.
\end{enumerate}

Within the literature, a mutlitude of machine learning techniques have been employed for this class of problem: clustering, SOMS, projection pursuit, hierarchial clustering analysis, and boosting. The author's of this paper made use of a novel technique to exploit linear classification in higher dimenstional spaces.

% Software Details
The scope of our efforts will be to explore the author's novel technique on other datasets with the intention of getting similar results. The author's of, "Projection Based Clustering of Gene Expression Data" \cite{TasoulisPT09} have made their datasets publically available. The Reverter et al experiments considered only a subset of this data.

The Dataset contains the gene expressions of individuals with medical conditions and healthy patients. As an overview the set includes the following: COLON dataset (colon tumors), LYMPHOMA dataset (3 lymphoid maligancy types), PROSTATE dataset (prostate tumors), ALL dataset (acute lymphoblastic leukemia).

The scope of our research efforts will be to analyze the remaining data sets with the novel KPCA technique purposed by Reverter et al. As a base line comparison, we propose to analyze the considered data sets using support vector machines. We feel this is a warranted approach since it makes use of the kernel trick, and it will allow for a better comparison between kernel based methods.

% Project Member details
The following breakdown justifies the group roles. Due to the size of the project, each member will be involved in implementing the techniques and researching the literature. Particular roles (as listed below) are assigned based on previous skills and education. Though, it is this groups intention for each individual to engage in all areas of the research as to learn about all the techniques. The breakdown of implementing the KPCA and SVD will be a distributed process.
\begin{enumerate}  \itemsep -2pt % squeeze the numbers together. Way too much spacing for my liking
\item
{\bf Michael Semeniuk:} Implementation, Problem Specification, and Bioinformatics approaches
\item
{\bf Albert Steppi:} Implementation, Mathematics, and Statistical Methods
\item
{\bf Christopher Wolas:} Implementation, Statistical Analysis, and Machine Learning  
\end{enumerate}

% We need to keep the \bibliographystyle... unless the bibliography doesn't display. At least for me...
\bibliographystyle{plain}	% (uses file "plain.bst")
\bibliography{cited}		% expects file "cited.bib"

\end{document}






























































